17 research outputs found

    Symmetric L-graphs

    Get PDF
    In this paper we characterize symmetric L-graphs, which are either Kronecker products of two cycles or Gaussian graphs. Vertex symmetric networks have the property that the communication load is uniformly distributed on all the vertices so that there is no point of congestion. A stronger notion of symmetry, edge symmetry, requires that every edge in the graph looks the same. Such property ensures that the communication load is uniformly distributed over all the communication links, so that there is no congestion at any link.Peer Reviewe

    Peripheral twists for torus topologies with arbitrary aspect ratio

    Get PDF
    A torus is a common topology used in supercomputer networks. Asymmetric Tori suffer from resource usage imbalance, which translates to reduced performance. Twisted Tori employ a twist in the peripheral links of one or more dimensions to improve the topological parameters and overall performance of asymmetric networks. 2D and 3D twisted tori with aspect ratios 2:1 and 2:1:1 have been studied in detail. However, commercial machines do not necessarily employ those aspects ratios. In this work we present an early study of the effect of peripheral link twisting in multidimensional twisted tori with arbitrary aspect ratios. We observe that, in the general case, it is impossible to find a specific twist that minimizes all the interesting topological parameters of the network. We also introduce a requirement for the use of several twists in multidimensional torus with adaptive routing.Postprint (author’s final draft

    Efficient routing mechanisms for Dragonfly networks

    Get PDF
    High-radix hierarchical networks are cost-effective topologies for large scale computers. In such networks, routers are organized in super nodes, with local and global interconnections. These networks, known as Dragonflies, outperform traditional topologies such as multi-trees or tori, in cost and scalability. However, depending on the traffic pattern, network congestion can lead to degraded performance. Misrouting (non-minimal routing) can be employed to avoid saturated global or local links. Nevertheless, with the current deadlock avoidance mechanisms used for these networks, supporting misrouting implies routers with a larger number of virtual channels. This exacerbates the buffer memory requirements that constitute one of the main constraints in high-radix switches. In this paper we introduce two novel deadlock-free routing mechanisms for Dragonfly networks that support on-the-fly adaptive routing. Using these schemes both global and local misrouting are allowed employing the same number of virtual channels as in previous proposals. Opportunistic Local Misrouting obtains the best performance by providing the highest routing freedom, and relying on a deadlock-free escape path to the destination for every packet. However, it requires Virtual Cut-Through flow-control. By contrast, Restricted Local Misrouting prevents the appearance of cycles thanks to a restriction of the possible routes within super nodes. This makes this mechanism suitable for both Virtual Cut-Through and Wormhole networks. Evaluations show that the proposed deadlock-free routing mechanisms prevent the most frequent pathological issues of Dragonfly networks. As a result, they provide higher performance than previous schemes, while requiring the same area devoted to router buffers.This work has been supported by the Spanish Ministry of Science under contracts TIN2010-21291-C02-02, TIN2012-34557, and by the European HiPEAC Network of Excellence. The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement n. ERC-2012-Adg-321253-RoMoL. M. Garc´ıa and M. Odriozola participated in this research work while they were affiliated with the University of Cantabria.Peer ReviewedPostprint (author's final draft

    Light NUCA: a proposal for bridging the inter-cache latency gap

    Get PDF
    To deal with the “memory wall” problem, microprocessors include large secondary on-chip caches. But as these caches enlarge, they originate a new latency gap between them and fast L1 caches (inter-cache latency gap). Recently, Non-Uniform Cache Architectures (NUCAs) have been proposed to sustain the size growth trend of secondary caches that is threatened by wire-delay problems. NUCAs are size-oriented, and they were not conceived to close the inter-cache latency gap. To tackle this problem, we propose Light NUCAs (L-NUCAs) leveraging on-chip wire density to interconnect small tiles through specialized networks, which convey packets with distributed and dynamic routing. Our design reduces the tile delay (cache access plus one-hop routing) to a single processor cycle and places cache lines at a finer-granularity than conventional caches reducing cache latency. Our evaluations show that in general, L-NUCA improves simultaneously performance, energy, and area when integrated into both conventional or D-NUCA hierarchies.Postprint (author’s final draft

    Task mapping in rectangular twisted tori

    Get PDF
    Twisted torus topologies have been proposed as an alternative to toroidal rectangular networks, improving distance parameters and providing network symmetry. However, twisting is apparently less amenable to task mapping algorithms of real life applications. In this paper we make an analytical study of different mapping and concentration techniques on 2D twisted tori that try to compensate for the twisted peripheral links. We introduce a performance model based on the network average distance and the detection of the set of links which receive the highest load. The model also considers the amount of local and global communications in the network. Our model shows that the twisted torus can improve latency and maximum throughput over rectangular torus, especially when global communications dominate over local ones and when some concentration is employed. Simulation results corroborate our synthetic model. For real applications from the NPB benchmark suite, the use of the twisted topologies with an appropriate mapping provides overall average application speedups of 2.9%, which increase to 4.9% when concentrated topologies (c = 2) are considered.This work has been supported by the Spanish Ministry of Science under contracts TIN2010-21291-C02-02, TIN-2007- 60625, AP2010-4900 and CONSOLIDER Project CSD2007-00050, and by the European HiPEAC Network of Excellence. M. Moreto is supported by a MEC/Fulbright Fellowship.Postprint (author’s final draft

    CATA: Criticality aware task acceleration for multicore processors

    Get PDF
    Managing criticality in task-based programming models opens a wide range of performance and power optimization opportunities in future manycore systems. Criticality aware task schedulers can benefit from these opportunities by scheduling tasks to the most appropriate cores. However, these schedulers may suffer from priority inversion and static binding problems that limit their expected improvements. Based on the observation that task criticality information can be exploited to drive hardware reconfigurations, we propose a Criticality Aware Task Acceleration (CATA) mechanism that dynamically adapts the computational power of a task depending on its criticality. As a result, CATA achieves significant improvements over a baseline static scheduler, reaching average improvements up to 18.4% in execution time and 30.1% in Energy-Delay Product (EDP) on a simulated 32-core system. The cost of reconfiguring hardware by means of a software-only solution rises with the number of cores due to lock contention and reconfiguration overhead. Therefore, novel architectural support is proposed to eliminate these overheads on future manycore systems. This architectural support minimally extends hardware structures already present in current processors, which allows further improvements in performance with negligible overhead. As a consequence, average improvements of up to 20.4% in execution time and 34.0% in EDP are obtained, outperforming state-of-the-art acceleration proposals not aware of task criticality.This work has been supported by the Spanish Government (grant SEV2015-0493, SEV-2011-00067 of the Severo Ochoa Program), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316, TIN2012-34557, TIN2013-46957-C2-2-P), by Generalitat de Catalunya (contracts 2014-SGR- 1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the EU’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 610402 and from the EU’s H2020 Framework Programme (H2020/2014-2020) under grant agreement no 671697. M. Moret´o has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243). E. Castillo has been partially supported by the Spanish Ministry of Education, Culture and Sports under grant FPU2012/2254.Peer ReviewedPostprint (author's final draft

    Peripheral twists for torus topologies with arbitrary aspect ratio

    No full text
    A torus is a common topology used in supercomputer networks. Asymmetric Tori suffer from resource usage imbalance, which translates to reduced performance. Twisted Tori employ a twist in the peripheral links of one or more dimensions to improve the topological parameters and overall performance of asymmetric networks. 2D and 3D twisted tori with aspect ratios 2:1 and 2:1:1 have been studied in detail. However, commercial machines do not necessarily employ those aspects ratios. In this work we present an early study of the effect of peripheral link twisting in multidimensional twisted tori with arbitrary aspect ratios. We observe that, in the general case, it is impossible to find a specific twist that minimizes all the interesting topological parameters of the network. We also introduce a requirement for the use of several twists in multidimensional torus with adaptive routing

    Peripheral twists for torus topologies with arbitrary aspect ratio

    No full text
    A torus is a common topology used in supercomputer networks. Asymmetric Tori suffer from resource usage imbalance, which translates to reduced performance. Twisted Tori employ a twist in the peripheral links of one or more dimensions to improve the topological parameters and overall performance of asymmetric networks. 2D and 3D twisted tori with aspect ratios 2:1 and 2:1:1 have been studied in detail. However, commercial machines do not necessarily employ those aspects ratios. In this work we present an early study of the effect of peripheral link twisting in multidimensional twisted tori with arbitrary aspect ratios. We observe that, in the general case, it is impossible to find a specific twist that minimizes all the interesting topological parameters of the network. We also introduce a requirement for the use of several twists in multidimensional torus with adaptive routing

    Efficient routing mechanisms for Dragonfly networks

    No full text
    High-radix hierarchical networks are cost-effective topologies for large scale computers. In such networks, routers are organized in super nodes, with local and global interconnections. These networks, known as Dragonflies, outperform traditional topologies such as multi-trees or tori, in cost and scalability. However, depending on the traffic pattern, network congestion can lead to degraded performance. Misrouting (non-minimal routing) can be employed to avoid saturated global or local links. Nevertheless, with the current deadlock avoidance mechanisms used for these networks, supporting misrouting implies routers with a larger number of virtual channels. This exacerbates the buffer memory requirements that constitute one of the main constraints in high-radix switches. In this paper we introduce two novel deadlock-free routing mechanisms for Dragonfly networks that support on-the-fly adaptive routing. Using these schemes both global and local misrouting are allowed employing the same number of virtual channels as in previous proposals. Opportunistic Local Misrouting obtains the best performance by providing the highest routing freedom, and relying on a deadlock-free escape path to the destination for every packet. However, it requires Virtual Cut-Through flow-control. By contrast, Restricted Local Misrouting prevents the appearance of cycles thanks to a restriction of the possible routes within super nodes. This makes this mechanism suitable for both Virtual Cut-Through and Wormhole networks. Evaluations show that the proposed deadlock-free routing mechanisms prevent the most frequent pathological issues of Dragonfly networks. As a result, they provide higher performance than previous schemes, while requiring the same area devoted to router buffers.This work has been supported by the Spanish Ministry of Science under contracts TIN2010-21291-C02-02, TIN2012-34557, and by the European HiPEAC Network of Excellence. The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement n. ERC-2012-Adg-321253-RoMoL. M. Garc´ıa and M. Odriozola participated in this research work while they were affiliated with the University of Cantabria.Peer Reviewe

    Light NUCA: a proposal for bridging the inter-cache latency gap

    No full text
    To deal with the “memory wall” problem, microprocessors include large secondary on-chip caches. But as these caches enlarge, they originate a new latency gap between them and fast L1 caches (inter-cache latency gap). Recently, Non-Uniform Cache Architectures (NUCAs) have been proposed to sustain the size growth trend of secondary caches that is threatened by wire-delay problems. NUCAs are size-oriented, and they were not conceived to close the inter-cache latency gap. To tackle this problem, we propose Light NUCAs (L-NUCAs) leveraging on-chip wire density to interconnect small tiles through specialized networks, which convey packets with distributed and dynamic routing. Our design reduces the tile delay (cache access plus one-hop routing) to a single processor cycle and places cache lines at a finer-granularity than conventional caches reducing cache latency. Our evaluations show that in general, L-NUCA improves simultaneously performance, energy, and area when integrated into both conventional or D-NUCA hierarchies
    corecore